Video/Audio Transcription and Summarization

OJT Project Report and Demonstration

Omkar Ninav
M.Sc. Statistics

Under the guidance of
Dr. Dinesh Helwade

2025-07-01

Preface

This On-the-Job Training (OJT) project provided an invaluable opportunity to apply theoretical knowledge of natural language processing, deep learning, and data engineering into a real-world application. The primary objective was to build a pipeline for automatic transcription and summarization of educational and conference video content, aimed at making such content more accessible and searchable.

I would like to express my sincere gratitude to Dr. Dinesh Helwade for his guidance and encouragement throughout the project. It was under his initiative that this project was designed with a vision to open-source transcription and summarization tools for the benefit of educational institutions and research communities.

This project marks a critical step in combining machine learning technology with accessibility goals in education.

Introduction/Problem Statement

Video content has become the most preferred medium for disseminating information — especially in academia, conferences, and online education. However, extracting insights or reviewing content from long-form videos remains time-consuming and inefficient.

Key Challenges

  • Manual note-taking is tedious and error-prone
  • Searchability within videos is limited
  • Lack of concise summaries for quick reference
  • Existing transcription tools are often paid or closed-source

About Organization

Pradnyaa InfoVision, headquartered in Pune, India, is a specialized analytics and consulting firm with over four years of experience delivering data-driven solutions. The company operates across two core domains: Retail Analytics and Biostatistics, offering deep domain expertise and tailored consulting to global clients.

🛒 Retail Analytics Division

The Retail Analytics team at Pradnyaa InfoVision excels in demand forecasting, leveraging both standard statistical models and cutting-edge machine learning algorithms. Their capabilities span across:

  • Regular Price Optimization

  • Markdown Strategy & Optimization

  • Inventory Optimization

One of the firm’s flagship projects involved partnering with a major U.S. retailer to design and execute a comprehensive price test across the entire U.S. region—demonstrating the company’s global reach and strategic insight.

🧪 Biostatistics & Clinical Trial Analytics

In the life sciences space, Pradnyaa InfoVision plays a key role in the analysis and processing of all three phases of clinical trials. The Biostatistics division supports pharmaceutical and healthcare companies by providing end-to-end statistical solutions that comply with global regulatory standards.

🌍 Offshore Delivery and Managed Services

Beyond analytics, the company also provides strategic consulting for offshore office setup and management in India. This includes team recruitment, operational oversight, and seamless integration with client business processes, enabling clients to establish a strong and scalable presence in India.

Project Objectives

  • Automate Speech-to-Text Conversion
  • Implement Robust Summarization
  • Design an Interactive Web Interface
  • Support Long Audio/Video Files
  • Ensure Local and Scalable Deployment
  • Improve Accessibility and Productivity

Pipeline

Tools & Technologies Used

  • Speech Models: Whisper(Base & Medium), Faster-Whisper
  • Summarization: BART (facebook/bart-large-cnn), SUMY(Luhn)
  • Web Interface: Streamlit
  • Preprocessing: FFmpeg
  • Programming Language: Python
  • Others: ONNX Runtime, Transformers

Model Comparison: Features at a Glance

Model Accuracy Offline Multilingual Ease of Use Cost
Whisper ✅✅✅ ✅✅✅ ✅✅ Free
Wav2Vec 2.0 ✅✅ ⚠️ (mostly English) Free
Google API ✅✅✅ ✅✅✅ ✅✅✅ Paid
DeepSpeech ✅✅ Free
Kaldi ✅✅ ✅ (with effort) ⚠️ Complex Free
Vosk ✅✅ ✅✅ ✅✅✅ Free

Limitations

  • Every machine learning project has constraints.
  • This section highlights key model, summarization, and deployment limitations.
  • Helps define boundaries for interpretation and guides future improvements.

Model-Level Limitations: Whisper

  • Hardware Requirements: Large Whisper models need GPUs; CPU inference is slow.
  • Inconsistent Formatting: Output lacks punctuation and proper sentence segmentation.
  • Limited Domain Understanding: Struggles with mathematical or technical language.
  • Accent and Noise Sensitivity: Accuracy drops slightly with strong accents or noise.

Model-Level Limitations: BART

  • Hallucinations: May invent information not in the transcript.
  • Chunking Issues: Large inputs must be split, hurting context retention.
  • Factual Inaccuracy: Sometimes combines unrelated ideas.
  • Weak with Technical Content: Fails to handle formulas or structured arguments.

Project-Level Limitations

  • English-Only Evaluation: Indian languages like Hindi, Marathi not tested.
  • Slow Inference: Large models cause delays even with GPU.
  • No Real-Time Support: Pipeline is batch-based, not live.
  • Minimal Post-Editing: Output not cleaned for publication.

Future Scope

  • Fine-tune models for domain-specific use (math, stats, etc.)
  • Add punctuation restoration and sentence boundary detection
  • Evaluate on regional languages and accents
  • Explore lightweight or multilingual alternatives (e.g., DistilBART)
  • Research math-aware or rule-based summarizers

Key Learnings

  • Gained hands-on experience with speech-to-text pipelines
  • Understood limitations of transformer-based summarizers
  • Built a real-world app using Streamlit and ONNX
  • Improved skills in modular Python design and lazy loading
  • Learned how to evaluate model fit for domain-specific data

App

Conclusion

  • We successfully built an end-to-end speech-to-text and summarization pipeline.
  • Leveraged powerful models like Whisper and BART for high-quality transcription and summarization.
  • Explored alternative models and toolkits for varied use cases (offline, multilingual, edge devices).
  • Deployed the solution using Streamlit to provide a user-friendly interface.
  • Identified limitations with respect to summarizing technical or mathematical content.
  • Gained practical experience with NLP tools, model selection, and pipeline integration.

Acknowledgements

  • Dr. Dinesh Helwade for guidance and support
  • Purva Puranik and Pranav Ransing for technical assistance
  • Pradnyaa InfoVision for providing the platform and resources
  • Open-source community for tools like Whisper, BART, and Streamlit

Thank You!

  • I appreciate your time and attention.
  • Looking forward to your questions and feedback.

Scan the QR code to view the presentation: